Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Num Shuffle Files : Gets the count of shuffle files pulled into memory for a filter condition #76

Merged

Conversation

joydeepbroy-zeotap
Copy link
Collaborator

@joydeepbroy-zeotap joydeepbroy-zeotap commented Jul 17, 2023

This PR has two methods:
getNumShuffleFiles and getShuffleFileMetadata

First, get the count of shuffle files pulled into memory when a particular condition is provided in merge for a Delta Table.
The second one gets all the file metadata for the parquet files.

Comment on lines 113 to 125
def getShuffleFileMetadata(path: String, condition: String)
= {
val (deltaLog, unresolvedColumns, targetOnlyPredicates, minMaxOnlyExpressions, equalOnlyExpressions,otherExpressions, removedPredicates) = getResolvedExpressions(path, condition)
deltaLog.withNewTransaction { deltaTxn =>
(deltaTxn.filterFiles(targetOnlyPredicates),
deltaTxn.filterFiles(minMaxOnlyExpressions),
deltaTxn.filterFiles(equalOnlyExpressions),
deltaTxn.filterFiles(otherExpressions),
deltaTxn.filterFiles(removedPredicates),
deltaLog.snapshot.filesWithStatsForScan(Nil),
unresolvedColumns)
}
}
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I did not add any test cases for this method as the file metadata received is directly from Delta Table and I am not making any mutations to it. Additionally, please review if this method would be useful. I can remove it as well.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't have any use case in mind but I think it may be of help for someone investigating the behavior of a query. My only advice here would be to explicitly add the return type in the method signature so it can serve as reference for someone that potentially would like to use this method.

README.md Outdated

then the output might look like
```scala
(18, 100, 300, 600, 800, 800, List())
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should this output be a map that indicates what each value represents?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have changed the output to a MAP and updated the method comments. Please check if it looks good, then I will update README.md @MrPowers

@MrPowers
Copy link
Collaborator

LGTM. The list with multiple outputs might be hard to parse, but the functionality looks cool. We should definitely write a blog post on this one.

Copy link
Collaborator

@brayanjuls brayanjuls left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall looks good, I just added a couple of comments.

Comment on lines 113 to 125
def getShuffleFileMetadata(path: String, condition: String)
= {
val (deltaLog, unresolvedColumns, targetOnlyPredicates, minMaxOnlyExpressions, equalOnlyExpressions,otherExpressions, removedPredicates) = getResolvedExpressions(path, condition)
deltaLog.withNewTransaction { deltaTxn =>
(deltaTxn.filterFiles(targetOnlyPredicates),
deltaTxn.filterFiles(minMaxOnlyExpressions),
deltaTxn.filterFiles(equalOnlyExpressions),
deltaTxn.filterFiles(otherExpressions),
deltaTxn.filterFiles(removedPredicates),
deltaLog.snapshot.filesWithStatsForScan(Nil),
unresolvedColumns)
}
}
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't have any use case in mind but I think it may be of help for someone investigating the behavior of a query. My only advice here would be to explicitly add the return type in the method signature so it can serve as reference for someone that potentially would like to use this method.

Copy link
Collaborator

@brayanjuls brayanjuls left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM.

@joydeepbroy-zeotap joydeepbroy-zeotap merged commit 96060b2 into mrpowers-io:main Sep 12, 2023
8 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants